Geographical analysis of media flows

A multidimensional approach

Claude Grasland (Université de Paris (Diderot), FR 2007 CIST, UMR 8504 Géographie-cités)

featured

Introduction

1 Corpus preparation

The aim of this section is to prepare a corpus of news related to a language and one or several countries over a period of time. As an example, we will try to prepare a corpus of news in french (fr) related to France (FRA), Belgium (BEL) and Algeria (DZA) over a period of 2 years from 1st Jan 2014 to 31th December 2015. As news include not only titles but also descriptions, we decide to break the descriptions in sentences and keep a maximum of 2 sentences by news, with a maximum of 100 tokens by sentence.

The data used in this example has been collected by the research project ANR Geomedia and are free to use for scientific and pedagogical purpose only. The content of the news should not be used or disseminated without the agreement of the newspapers.

1.1 Selection of media

We import the data provided by each media and put them in a single data.frame. Then we select the columns of interest

# Load data with the function fread (fast) and the encoding UTF-8
  df1<-fread("data/source/fr_FRA_figaro_int.csv", encoding = "UTF-8")
  df1$media<-"fr_FRA_figaro"
  
  
  df2<-fread("data/source/fr_FRA_libera_int.csv", encoding = "UTF-8")
  df2$media<-"fr_FRA_libera"
  
  df3<-fread("data/source/fr_BEL_derheu_int.csv", encoding = "UTF-8")
  df3$media<-"fr_BEL_derheu"
  
  df4<-fread("data/source/fr_BEL_lesoir_int.csv", encoding = "UTF-8")
  df4$media<-"fr_BEL_lesoir"
  
  df5<-fread("data/source/fr_DZA_elwata_int.csv", encoding = "UTF-8")
  df5$media<-"fr_DZA_elwata"
  
  df6<-fread("data/source/fr_DZA_xpress_int.csv", encoding = "UTF-8")
  df6$media<-"fr_DZA_xpress"
  
  # transform in data.table format
  df<-rbind(df1,df2,df3,df4,df4,df5,df6)
  rm(df1,df2,df3,df4,df5,df6)
  
  
  # select column of interest
  df$id <- df$ID_Item
  df$who <- df$media
  df$when <- df$Date_Recup
  df$text <- paste(df$Titre," . ", df$Description, sep="")
  df<-df[,c("id","who","when","text")]
  df<-df[order(when),]
  
  # select period of interest
  mintime<-as.Date("2014-01-01")
  maxtime<-as.Date("2015-12-31")
  df<-df[(is.na(df$when)==F),] 
  df<-df[as.Date(df$when) >= mintime,]
  df<-df[as.Date(df$when) <= maxtime,]
  
  # eliminate duplicate
  df<-df[duplicated(df$text)==F,]

1.2 Check of time frequency

1.2.1 Time divisions

We transform the previous data.frame in a data.table format for easier operations of aggregation

dt<-as.data.table(df)
  dt$day     <- as.Date(dt$when)
  dt$week    <- cut(dt$day, "weeks", start.on.monday=TRUE)
  dt$month   <- cut(dt$day, "months")
  dt$weekday <- weekdays(dt$day)
  
  # Save data frame
  saveRDS(dt,"data/corpus/dt_mycorpus.RDS") 

1.2.2 News by week

We examine if the distribution is regular by week for the different media of the corpus.

dt<-readRDS("data/corpus/dt_mycorpus.RDS")
  news_weeks<-dt[,.(newstot=.N),by=.(week,who)]
  
  p<-ggplot(news_weeks, aes(x=as.Date(week),y=newstot, col=who))+
     geom_line()+
     geom_smooth(method = 'loess', formula = 'y~x')+
     scale_y_continuous("Number of news", limits = c(0,NA)) +
     scale_x_date("Week (starting on monday)") +
           ggtitle(label ="Corpus : distribution of news by week",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

1.2.3 News by weekday

We examine if the distribution is regular by weekday and check in particular the effect of the week-end.

#compute frequencies by weekday
  news_weekdays<-dt[,.(newstot=.N),by=.(weekday,who)]
  news_weekdays<-news_weekdays[,.(weekday,newspct=100*newstot/sum(newstot)),by=.(who)]
  
  
  # Translate weekdays in english and order
  news_weekdays$weekday<-as.factor(news_weekdays$weekday)
  levels(news_weekdays$weekday)<-c("7.Sunday","4.Wednesday","1.Monday","2.Tuesday","3.Thursday","6.Sathurday","5.Friday")
  news_weekdays$wkd<-as.factor(as.character(news_weekdays$weekday))
  news_weekdays<-news_weekdays[order(news_weekdays$weekday),]
  
  
  p<-ggplot(news_weekdays, aes(x=weekday,fill = who, y=newspct))+
           geom_bar(position = "dodge", stat="identity")+
           scale_y_continuous("Share of news (%)", limits = c(0,NA)) +
           ggtitle(label ="Corpus : distribution of news by week day",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

1.3 Transform in quanteda corpus

1.3.1 Reshape news by sentences

The aim of this step is to harmonize the length of texts collected through rss. We decide to keep only the title of news and the two first sentences of descriptions when they are available. The result is stored in quanteda format.

Unfortunately, the division in text is sentences realized by quanteda is far from perfect which is due to problems in the collection of news. For example, the following text will be considered as a single sentence because the point is not followed by a blank character.

Le conflit est terminé.Mais la Russie est-elle d’accord avec la Turquie.

It is necessary to add a regular expression for the cleaning of text and the inclusion of a blank space " " after each point located after a lower case character and before an upper case character :

str_replace_all(txt,“(?<=[:lower:])\.(?=[:upper:])”, “\.”)

In order to obtain a text that will be recognised as made of 2 sentences.

Le conflit est terminé. Mais la Russie est-elle d’accord avec la Turquie.

Some sentences appears too short or too long for a sound analysis. Therefore, we decide to eliminate outliers based on the quantile of the numbe of tokens. In practice we decide to eliminate the sentences with more than 100 tokens or less than 3 tokensr

t1<-Sys.time()
  dt<-readRDS("data/corpus/dt_mycorpus.RDS")
  
  
  
  # clean sentences break (long !)
  dt$text<-str_replace_all(dt$text,"(?<=[:lower:])\\.(?=[:upper:])", "\\. ")
  
  
  # transform in quanteda
  qd<-corpus(dt,docid_field = "id",text_field = "text")
  
  
  
  # break in sentences
  qd<-corpus_reshape(qd,to="sentences", use_docvars=T)
  
  # Identify rank of sentences
  qd$order<-as.numeric(as.data.frame(str_split(names(qd),"\\.", simplify=T))[,2])
  
  # Select only title + maximum of 3 sentences
  qd<-corpus_subset(qd, order < 5)
  
  # filter by number of tokens by sentence
  qd$nbt<-ntoken(texts(qd))
  #mintok<-quantile(qd$nbt,0.01)
  #maxtok<-quantile(qd$nbt,0.99)
  #qd<-corpus_subset(qd, nbt>mintok)
  qd<-corpus_subset(qd, nbt<100)
  qd<-corpus_subset(qd, nbt>2)
  
  
  
  # Save corpus in qd format
  saveRDS(qd,"data/corpus/qd_mycorpus.RDS")
  
  t2<-Sys.time()
  paste("Program executed in ", t2-t1)
  
  head(qd)
  summary(qd,3)

1.3.2 Number of sentences by media

We check the number of sentences available by title (1) and order of sentences in description (2 to 5)

qd<-readRDS("data/corpus/qd_mycorpus.RDS")
  x<-data.table(docvars(qd))
  
  
  tab<-x[,.(tot=.N),by=.(who,order)]
  tab<-dcast(tab,order~who)
  tab$order<-as.factor(tab$order)
  levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
  kable(tab, caption = "Distribution of title and sentences by media")
Distribution of title and sentences by media
order fr_BEL_derheu fr_BEL_lesoir fr_DZA_elwata fr_DZA_xpress fr_FRA_figaro fr_FRA_libera
Title 6994 10815 2896 4794 9449 13703
Sent1 6962 10730 2925 4755 9423 11227
Sent2 1884 3517 2893 249 3591 3453
Sent3 518 995 2867 16 536 444

1.3.3 Size of texts by month

We visualize the distribution of sentences of different order through time in order to prepare a decision on the length of text to be kept.

tab<-x[,.(tot=.N),by=.(month,order)]
  tab$month<-as.Date(tab$month)
  tab$order<-as.factor(tab$order)
  levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
  
         
         p<-ggplot(tab, aes(x=month,fill = order, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Corpus : distribution of titles and sentences by month",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

2 Geographical tags

The aim of this section is to add to the quanteda corpus different metadata related to the geographical entities that are mentioned in the news. We do not discuss here the problems related to the choice of a list of entities and we just apply a method of recognition based on a dictionary. We distinguish between three categories of geographical entities :

  • states : recognized by a combination of country names or capital cities
  • geographical regions : based on continents or other “natural” features like sea, topography, biogeography, …
  • international organizations ; based on a list established with wikipedia.

More details on the methodology for the creation of dictionaries are discussed in the media cookbook.

2.1 Preparation of data

2.1.1 Load list and definition of entities

#Load dictionary
  ent<-readRDS("data/dict/worldgeo_def_V1.RDS")
  
  # Select language of definition
  ent <- ent[ent$lang=="fr",-1]
  
  # Eliminate duplicated labels
  ent <- ent[duplicated(ent)==F,]
  ent <- ent[duplicated(ent$label)==F,]
  ent <- ent[duplicated(ent$code)==F,]
  # Visualize
  head(ent)
   lang       label type   code
  1    fr     Afrique cont CO_AFR
  6    fr    Amérique cont CO_AMR
  11   fr        Asie cont CO_ASI
  16   fr      Europe cont CO_EUR
  21   fr     Océanie cont CO_OCE
  26   fr Antarctique cont CO_ANT

2.1.2 Load dictonary

We start by loading the last version of the Imageun dictionary and we extract our target language (here : french).

#Load dictionary
  dict<-readRDS("data/dict/worldgeo_dict_V4.RDS")
  
  # Eliminate wikipedia codes
  dict <- dict[dict$lang=="fr",-1]
  
  # Eliminate duplicated labels
  dict <- dict[duplicated(dict)==F,]
  dict <- dict[duplicated(dict$label)==F,]
  
  # Check if all codes are available
  dict <- dict[dict$code %in% ent$code,]
  
  # Visualize
  head(dict)
   lang                   label type   code
  1    fr la République gabonaise  sta ST_GAB
  2    fr                   Gabon  sta ST_GAB
  11   fr    République gabonaise  sta ST_GAB
  15   fr                  Gambie  sta ST_GMB
  16   fr la République de Gambie  sta ST_GMB
  26   fr                  Guinée  sta ST_GIN

2.1.3 Load corpus

qd <- readRDS("data/corpus/qd_mycorpus.RDS")

2.1.4 Load tagging function

extract_tags <- function(qd = qd,                      # the corpus of interest
                           lang = "fr",                  # the language to be used
                           dict = dict,                  # the dictionary of target 
                           code = "id" ,                  # variable used for coding
                           tagsname = "tags",                 # name of the tags column
                           split  = c("'","’","-"),       # split list
                           tolow = FALSE  ,                # Tokenize text
                           comps = c("Afrique du sud")  # compounds
                           )
  { 
  
  
    
  # Tokenize  
  x<-as.character(qd)
  
  
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         x <- gsub(reg," ",x)}  
  if(tolow) { x <- tolower(x)} 
  toks<-tokens(x)
  
  # compounds
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         comps<- gsub(reg," ",comps)}  
  if(tolow)       {comps <- tolower(comps)}  
  toks<-tokens_compound(toks,pattern=phrase(comps))
  
    
  # Load dictionaries and create compounds
  
    ## Target dictionary
  dict<-dict[dict$lang==lang & is.na(dict$label)==F,]
  target<-dict[ntoken(dict$label)>1,]
  labels <-dict$label
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         labels<- gsub(reg," ",labels)}  
  if(tolow)       {labels <- tolower(labels)}  
  toks<-tokens_compound(toks,pattern=phrase(labels))
    
   # create quanteda dictionary
  keys <-gsub(" ","_",labels)
  qd_dict<-as.list(keys)
  names(qd_dict)<-dict[[code]]
  qd_dict<-dictionary(qd_dict,tolower = FALSE)
  
  # Identify geo tags (states or reg or org ...)
  toks_tags <- tokens_lookup(toks, qd_dict, case_insensitive = F)
  toks_tags <- lapply(toks_tags, unique)
  toks_tags<-as.tokens(toks_tags)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[[tagsname]]<-as.character(lapply(toks_tags,FUN=list_tags))
  docvars(qd)[[paste("nb",tagsname,sep="")]]<-ntoken(toks_tags)
  
  
  
  # Export results
  return(qd)
   }

2.2 Geographical annotation

2.2.1 Annotate all entities

In a first step, we annotate all geographic entities together in order to benefit from the cross-definition of their respective compounds. We will separate them by subcategories in a second step.

# Less than 5 minutes for the tagging of 3 millions of sentences on a good PC computer.  
  
  t1<-Sys.time()
  
  frcomps<-c("Europe 1", "Atlantic city", "Nantes-Atlantique",
             "Loire-Atlantique", "Pyrénées-Atlantique", "Pyrénées-Atlantiques",
             "Alpes-de-Haute-Provence", "Hautes-Alpes", "Rhône-Alpes","Alpes-Maritimes",
             "Chantiers de l'Atlantique", "TGV Atlantique",
             "Bourse de Paris", "Paris SG", "Ville de Paris", "Grand Paris")
  
  qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict,
                       code = "code",
                       tagsname = "geo",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = FALSE)
  
  t2 = Sys.time()
  paste("Program executed in ", t2-t1)
  
  table(qd$nbgeo)

      0     1     2     3     4     5     6     7     8    11    12    13    22 
  68078 37910  7779  1444   296    73    26    18     6     1     3     1     1 

2.2.2 Extract states codes

state<-ent$code[ent$type %in% c("sta","cap")]
  test <- paste(state, collapse="|")
  x<-as.character(lapply(str_extract_all(qd$geo,paste(test, collapse = '|')), paste,collapse=" "))
  x<-gsub("ST_","",x)
  x<-gsub("CA_","",x)
  y<-tokens(x)
  y<-lapply(y, unique)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[["states"]]<-as.character(lapply(y,FUN=list_tags))
  docvars(qd)[["nbstates"]]<-ntoken(qd$states)

2.2.3 check news with maximum state number

table(qd$nbstates)

      0     1     2     3     4     5     6     7     8    10    11    12    22 
  71986 37289  5373   763   133    50    17    17     2     1     2     2     1 
check<-corpus_subset(qd,nbstates>7)
  x<-data.frame(who=check$who,when = check$when,text=as.character(check),states=check$states,nbstates=check$nbstates)
  x<-x[order(x$nbstates,decreasing = T),]
  kable(x)
who when text states nbstates
9845355 fr_DZA_elwata 2015-10-02 11:31:33 Iran: 464 morts, Egypte : 75, Nigeria : 64, Mali : 60, Indonésie : 57, Inde : 45 morts, Pakistan : 46, Niger : 22, Cameroun : 20, Côte d’Ivoire : 14, Tchad : 11, Algérie : 11, Sénégal : 10, Maroc : 10, Somalie : 8, Libye : 4, Tanzanie : 4, Kenya : 3, Tunisie : 2 et Burkina Faso, Burundi et Pays-Bas : 1 mort chacun. IRN EGY NGA MLI IDN IND PAK NER CMR CIV TCD DZA SEN MAR SOM LBY TZA KEN TUN BFA BDI NLD 22
148030 fr_DZA_elwata 2014-01-13 11:47:24 La réunion des ministres des Affaires étrangères des onze (Grande-Bretagne, Allemagne, Italie, France, Arabie Saoudite, Emirats arabes unis, Qatar, Egypte, Jordanie, Etats-Unis, Turquie) s’est ouverte, hier à Paris, en présence du président de la Coalition, Ahmad Jarba, pour discuter de la participation toujours incertaine de l’opposition à la conférence de paix sur la Syrie à partir du 22 janvier en Suisse. GBR DEU ITA FRA SAU QAT EGY JOR USA TUR SYR CHE 12
7377071 fr_DZA_elwata 2015-06-10 11:57:29 Durant cette rencontre, il est attendu environ 70 000 visiteurs professionnels venus des pays suivants : Pérou, France, Pays-Bas, Etats-Unis, Chine,  Hongrie, Espagne, Croatie, Russie, Danemark, Japon et Allemagne. PER FRA NLD USA CHN HUN ESP HRV RUS DNK JPN DEU 12
131590 fr_DZA_elwata 2014-01-10 10:56:30 Les 11 pays du groupe (Etats-Unis, Grande-Bretagne, France, Allemagne, Italie, Turquie, Egypte, Jordanie, Qatar, Arabie Saoudite et Emirats arabes unis) discuteront de questions relatives à la Syrie avant la conférence de Genève 2, prévue pour le 22 janvier prochain, a indiqué la porte-parole du département d’Etat, Jen Psaki. USA GBR FRA DEU ITA TUR EGY JOR QAT SAU SYR 11
7360024 fr_DZA_elwata 2015-06-09 17:57:27 L’Algérie, l’Egypte, la Tunisie, le Maroc, les 5+5 de la rive nord de la Méditerranée (Espagne, Italie, France, Portugal, Malte) et les Etats-Unis mènent depuis le mois de mars un embargo maritime dans les ports occupés par l’organisation de l’Etat islamique en Libye. DZA EGY TUN MAR ESP ITA FRA PRT MLT USA LBY 11
8657790 fr_DZA_elwata 2015-08-08 11:36:14 Les contentieux en mer de Chine méridionale sont, cette année encore, au centre des réunions de l’Association des nations d’Asie du Sud-Est (Asean), qui regroupe dix pays, à savoir la Malaisie, la Thaïlande, le Singapour, le Vietnam, l’Indonésie, les Philippines, le Laos, le Cambodge, la Birmanie et le sultanat de Brunei. MYS THA SGP VNM IDN PHL LAO KHM MMR BRN 10
3410181 fr_BEL_derheu 2014-10-25 14:24:57 L’OMS divise en deux groupes les pays touchés, distinguant les pays les plus gravement atteints (Guinée, Liberia, Sierra Leone) et ceux ayant dénombré un ou quelques cas (Mali, Nigeria, Sénégal, Espagne et États-Unis). GIN LBR SLE MLI NGA SEN ESP USA 8
7483367 fr_DZA_elwata 2015-06-15 11:33:44 Les pays visés par la CPI sont le Soudan, le Kenya, la Côte d’Ivoire, la Libye, la RDC, la Centrafrique, l’Ouganda et le Mali. SDN KEN CIV LBY COD CAF UGA MLI 8

2.2.4 Extract world region codes

We do not distinguish so-called “geographical” regions (like “Europe”) and “political” regions (like “European Union”) and put them in the same catagory of world regions i.e. first level of organization under the world level and/or first level of agregation above state level.

region<-ent$code[ent$type %in% c("sea","land","cont","org")]
  test <- paste(region, collapse="|")
  x<-as.character(lapply(str_extract_all(qd$geo,paste(test, collapse = '|')), paste,collapse=" "))
  y<-tokens(x)
  y<-lapply(y, unique)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[["regions"]]<-as.character(lapply(y,FUN=list_tags))
  docvars(qd)[["nbregions"]]<-ntoken(qd$regions)
  table(qd$nbregions)

2.2.5 Check news with maximum number of world regions

table(qd$nbregions)

       0      1      2      3      4 
  109289   6009    324     12      2 
check<-corpus_subset(qd,nbregions>2)
  x<-data.frame(who=check$who,when = check$when,text=as.character(check),regions=check$regions,nbregions=check$nbregions)
  x<-x[order(x$nbregions,decreasing = T),]
  kable(x)
who when text regions nbregions
3973331 fr_DZA_xpress 2014-12-13 10:40:51 Après l’Europe, le Moyen-Orient et l’Asie qui ont déjà le leur depuis des années, l’Afrique inaugure à son tour lundi un Forum sur la sécurité destiné à mobiliser le continent face aux défis du jihadisme et du développement. CO_EUR LA_east_middle CO_ASI CO_AFR 4
8955158 fr_DZA_elwata 2015-08-22 15:18:15 Plus de 104.000 migrants d’Afrique, du Proche-Orient et d’Asie du Sud ont atteint les ports d’Italie méridionale depuis le début de cette année après avoir été secourus en Méditerranée. CO_AFR LA_east_near CO_ASI SE_medit 4
578954 fr_BEL_lesoir 2014-03-13 09:40:55 Le Moyen-Orient, l’Asie du Sud et l’Afrique subsaharienne sont particulièrement concernés par le phénomène. LA_east_middle CO_ASI CO_AFR 3
1496189 fr_FRA_libera 2014-05-05 15:35:00 Asie, Moyen-Orient, Afrique centrale : plusieurs foyers de propagation suscitent l’inquiétude de l’organisation. CO_ASI LA_east_middle CO_AFR 3
3020740 fr_FRA_figaro 2014-09-22 21:37:47 Washington s’inquiète beaucoup d’un mouvement terroriste dont les membres seraient des cadres d’al-Qaida originaires du Moyen-Orient, de l’Asie du Sud et d’Afrique du Nord. LA_east_middle CO_ASI CO_AFR 3
3113264 fr_BEL_lesoir 2014-09-30 18:20:20 «Le bilan en Méditerranée s’alourdit tandis que l’Europe regarde ailleurs», dénonce l’ONG. Entretien avec John Dalhuisen, le responsable du programme Europe et Asie centrale d’Amnesty. SE_medit CO_EUR CO_ASI 3
3190462 fr_DZA_elwata 2014-10-07 11:17:51 «L’ONUDC estime que les deux principaux itinéraires de trafic illicite d’Afrique du Nord, de l’Est et de l’Ouest vers l’Europe, et d’Amérique du Sud vers l’Amérique du Nord, génèrent près de 7 milliards de dollars», a indiqué à la presse Yuri Fedotov, le directeur de l’institution basée à Vienne. CO_AFR CO_EUR CO_AMR 3
3327817 fr_DZA_elwata 2014-10-18 17:19:15 "“Le Conseil de sécurité de l’Onu devrait élargir le mandat de la Minurso  afin d’y inclure l’observation et la publication de rapports sur la situation  des droits humains, tant au Sahara occidental (occupé par le Maroc) que dans  les camps gérés par le Polisario en Algérie”", recommande HRW dans son rapport,  présenté par son directeur-adjoint chargé du Moyen-Orient et de l’Afrique du  Nord, Eric Goldstein, lors d’une conférence de presse. LA_sahara LA_east_middle CO_AFR 3
4290063 fr_BEL_derheu 2015-01-13 18:56:53 "“Nous parlons de quelque 3.000 à 5.000 citoyens de l’UE (Union européenne)”", a déclaré Rob Wainwright interrogé par la commission des affaires intérieures du Parlement britannique sur le nombre de personnes qui ont quitté l’Europe pour aller se battre au Proche-Orient. OR_EU CO_EUR LA_east_near 3
4999241 fr_FRA_figaro 2015-02-21 21:15:40 Servant de voie de passage entre l’Asie orientale, l’océan Indien et l’Europe, ils recèleraient d’importantes réserves d’hydrocarbures et halieutiques. CO_ASI SE_india CO_EUR 3
7360017 fr_BEL_lesoir 2015-06-09 17:57:23 Le vice-ministre des Affaires étrangères cubains, à Bruxelles pour le sommet UE-Amérique latine et Caraïbes, détaille les grands enjeux et points de friction. OR_EU CO_AMR SE_carai 3
7397493 fr_BEL_lesoir 2015-06-11 08:57:55 Deux jours pour tisser des liens plus profonds entre l’Union européenne, l’Amérique latine et les Caraïbes OR_EU CO_AMR SE_carai 3
9292003 fr_DZA_elwata 2015-09-07 11:17:14 La côte méditerranéenne de l’Egypte et celle de la Libye voisine sont régulièrement le point de départ de bateaux affrétés par des passeurs pour transporter vers l’Europe des migrants, certains fuyant les zones de conflits du Moyen-Orient,  comme la Syrie mais aussi la misère de certains pays d’Afrique. CO_EUR LA_east_middle CO_AFR 3
9885562 fr_DZA_elwata 2015-10-04 11:31:45 Cette tentative intervient tandis que l’UE fait face depuis plusieurs mois à l’arrivée massive d’hommes, de femmes et d’enfants pour la plupart en provenance de zones de conflit au Proche-Orient et en Afrique. OR_EU LA_east_near CO_AFR 3

2.2.6 Check news with mixtures of more than 2 states and more than 1 world regions

check<-corpus_subset(qd,nbregions>1 & nbstates >2)
  x<-data.frame(who=check$who,when = check$when,text=as.character(check),geo=check$geo,nbstates=check$nbstates, nbregions = check$nbregions)
  x<-x[order(x$nbstates*x$nbregions,decreasing = T),]
  kable(x)
who when text geo nbstates nbregions
8657790 fr_DZA_elwata 2015-08-08 11:36:14 Les contentieux en mer de Chine méridionale sont, cette année encore, au centre des réunions de l’Association des nations d’Asie du Sud-Est (Asean), qui regroupe dix pays, à savoir la Malaisie, la Thaïlande, le Singapour, le Vietnam, l’Indonésie, les Philippines, le Laos, le Cambodge, la Birmanie et le sultanat de Brunei. SE_china_south CO_ASI_south_east ST_MYS ST_THA ST_SGP ST_VNM ST_IDN ST_PHL ST_LAO ST_KHM ST_MMR ST_BRN 10 2
4053358 fr_DZA_elwata 2014-12-20 10:59:39 Le Groupe des cinq du Sahel – G5 Sahel : Tchad, Mali, Niger, Mauritanie, Burkina Faso – «lance un appel au Conseil de sécurité des Nations unies pour la mise en place, en accord avec l’Union africaine, d’une force internationale pour neutraliser les groupes armés, aider à la réconciliation nationale et mettre en place des institutions démocratiques stables» en Libye, est-il écrit dans le communiqué. LA_sahel ST_TCD ST_MLI ST_NER ST_MRT ST_BFA OR_AfrUn ST_LBY 6 2
9292003 fr_DZA_elwata 2015-09-07 11:17:14 La côte méditerranéenne de l’Egypte et celle de la Libye voisine sont régulièrement le point de départ de bateaux affrétés par des passeurs pour transporter vers l’Europe des migrants, certains fuyant les zones de conflits du Moyen-Orient,  comme la Syrie mais aussi la misère de certains pays d’Afrique. ST_EGY ST_LBY CO_EUR LA_east_middle ST_SYR CO_AFR 3 3
9012333 fr_FRA_figaro 2015-08-25 11:38:24 EN IMAGES - Depuis janvier, plus de 125.000 migrants sont entrés en Union européenne par la Hongrie en passant par la Serbie, la Macédoine et la Grèce, alors que l’Europe fait face à l’une des plus graves crises de réfugiés depuis la Seconde Guerre mondiale. OR_EU ST_HUN ST_SRB ST_MKD ST_GRC CO_EUR 4 2
3333619 fr_DZA_elwata 2014-10-19 11:20:41 Ainsi, les Etats-Unis ont effectué vendredi et samedi quinze frappes aériennes contre les djihadistes de l’organisation de l’EI en Syrie et dix raids en Irak, a annoncé le Commandement militaire américain pour le Proche-Orient et l’Asie centrale (Centcom) cité par l’AFP. ST_USA ST_SYR ST_IRQ LA_east_near CO_ASI_centr 3 2
10786640 fr_FRA_libera 2015-11-14 07:59:48 De Washington à Moscou, des Nations unies à l’Otan, dans toute l’Europe, les responsables ont condamné le carnage qui a fait plus de 120 morts et 200 blessées, vendredi soir à Paris. CA_USA CA_RUS OR_NATO CO_EUR CA_FRA 3 2

2.2.7 Save geographically anotated corpus

saveRDS(qd,"data/corpus/qd_mycorpus_geo.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/qd_mycorpus_geo.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  6.159 Mo"

3 Thematic tags

The aim of this section is to add to the quanteda corpus different metadata related to “thematic” tags i.e. topics that can be considered as related to the same same types of events but without considering space and time. In other words a thematic tag should not be related to an isolated crisis or an isolated country but to something more general that could take place anytime and anywhere. As an example, the earthquake from Nepal in April 2015 is not a topic but an event which belongs to the general category of news about seismic risk. The Ebola outbreak from western Africa is different because it is not an isolated event but a group of events spreading through space and time. It can therefore be considered as a sub-topic of the more general topic of pandemics. Finally the so-called “migrant crisis” that occurred in Europe in 2015 is certainly a mega-event but it is probably not a good idea to choose it as a topic. It is more interesting in this case to look for all news associated to the topic of “human mobility accross borders” and to search for news speaking from migrant, refugges or asylum seekers in all countries and at all periods of time. The problem of topic’s definition is therefore firstly a conceptual problem and only in a second step a question of keywords and dictionary. It the topic is not well defined, iy will not be possible to propose a consistent and coherent method of identification of news related to the topic.

The thematic tags will be added to the corpus previously annotated by geographical units

qd <-readRDS("data/corpus/qd_mycorpus_geo.RDS")

      0     1     2     3     4     5     6     7     8    11    12    13    22 
  68078 37910  7779  1444   296    73    26    18     6     1     3     1     1 

3.1 Load tagging function

We can use exactly the same program of tagging than the one we have used previously for the tagging of geographical entities. The only condition is to prepare a dataframe with three columns

  • lang : the language associated to the dictionary
  • code : the tag to be used
  • label: the words or groups of words to be recognized

3.2 The pandemic topic

3.2.1 Dictionary

We decide here to use lower case transformation. We use a star for the words that can take a plural form.

label <- c("épidémie*", "pandémie*", "virus", "oms", "ébola", "ebola",  "h1n1","sras", "chikungunya", "choléra", "peste")
  code  <- rep("pand", length(label))
  lang  <- rep("fr", length(label))
  dict_pande <- data.frame(code,lang,label)
  kable(dict_pande)
code lang label
pand fr épidémie*
pand fr pandémie*
pand fr virus
pand fr oms
pand fr ébola
pand fr ebola
pand fr h1n1
pand fr sras
pand fr chikungunya
pand fr choléra
pand fr peste
frcomps<-c("virus informatique")

3.2.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_pande,
                       code = "code",
                      tagsname = "pand",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$nbpand)

3.2.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-x$nbpand !=0
  tab<-x[,.(tot=.N),by=.(month,tag, who)]
  tab<-tab[tab$tag==TRUE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =who, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Pandemic : distribution of tags by month and media",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

3.3 The quake topic

3.3.1 Dictionary

We decide here to use lower case transformation. We use a star for the words that can take a plural form.

label <- c("séisme*", "sismiqu*", "tremblement de terre", "tremblements de terre", "tsunami*", "USGS","échelle de Richter", "secousse tellurique")
  code  <- rep("quak", length(label))
  lang  <- rep("fr", length(label))
  dict_quak <- data.frame(code,lang,label)
  kable(dict_quak)
code lang label
quak fr séisme*
quak fr sismiqu*
quak fr tremblement de terre
quak fr tremblements de terre
quak fr tsunami*
quak fr USGS
quak fr échelle de Richter
quak fr secousse tellurique
frcomps<-c("séisme politique")

3.3.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_quak,
                       code = "code",
                      tagsname = "quak",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$nbquak)

3.3.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-x$nbquak !=0
  tab<-x[,.(tot=.N),by=.(month,tag, who)]
  tab<-tab[tab$tag==TRUE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =who, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Earthquakes : distribution of tags by month and media",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

3.4 The Migrant & Refugees topic

3.4.1 Dictionary

In this last example, we introduce two subtopics corresponding to migrants and refugees or asylum seeker.The objective is to analyze a possibility of semantic shift.

label <- c("migrant*", "émigrant*", "immigrant*", "migrat*", "réfugié*", "demandeur d'asile")
  code  <- c("migr","migr","migr","migr", "refu","refu")
  lang  <- rep("fr", length(label))
  dict_mobil <- data.frame(code,lang,label)
  kable(dict_mobil)
code lang label
migr fr migrant*
migr fr émigrant*
migr fr immigrant*
migr fr migrat*
refu fr réfugié*
refu fr demandeur d’asile
frcomps<-c("oiseaux migrateurs")

3.4.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_mobil,
                       code = "code",
                      tagsname = "mobil",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$mobil)

3.4.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-as.factor(x$mobil)
  levels(x$tag)<-c(NA,"Migrants","Migrants & Refugies", "Refugiés","Migrants & Refugies" )
  tab<-x[,.(tot=.N),by=.(month,tag)]
  tab<-tab[is.na(tab$tag)==FALSE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =tag, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Mobility : distribution of tags by month and qualification",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

We store the quanteda file which combine geographical and thematic tags :

saveRDS(qd,"data/corpus/qd_mycorpus_geo_top.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/qd_mycorpus_geo_top.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  7.039 Mo"

4 Hypercubes

This section is based on the TELEMAC application elaborated during the H2020 projected ODYCCEUS and presented in the paper published in the journal Frontiers and available at https://analytics.huma-num.fr/Claude.Grasland/telemac/

Our objective is to elaborate an hypercube organised by different dimensions. As an example, we suppose that we are interested in the analysis of the crisis of migrant and refugees (what) in different newspapers (who), at different period of time (when) and we want to explore the locations of countries that are mentioned (where) and eventually associated together (where1.where2). Finally we want to distinguish inside the news the possible changes of results if we consider the title or the first, second and third sentences of the description (order).

4.1 Definition of dimensions

To illustrate this different options, we can look at the example of a news published by the Algerian newspaper El Watan the 16th September 2015 and divided in a title and three sentences of description.

qd<-readRDS("data/corpus/qd_mycorpus_geo_top.RDS")
  
  examp<-corpus_subset(qd,docid(qd) == 9486265)
  
  kable(paste(examp))
x
Crise des réfugiés en Europe : Vers un conseil des chefs d’Etat et de gouvernement de l’UE .
L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.
Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP.
«C’est un problème pour l’Union européenne dans son ensemble, c’est pourquoi nous nous sommes prononcés pour la tenue la semaine prochaine d’un conseil extraordinaire de l’UE», a déclaré la chancelière allemande lors d’une conférence de presse avec son homologue autrichien Werner Faymann.

Thanks to the previous operations of geographical and topical tagging, we can propose a simplified table where the text of the news has been removed and where we keep only the information of interest for the agregation procedure.

examp$id<-as.character(docid(examp))
  dtexamp<-data.table(tidy(examp)) %>% select(id=id, order = order, who = who, when=day, what=mobil, where1 = states, where2=states)
  kable(dtexamp)
id order who when what where1 where2
9486265 1 fr_DZA_elwata 2015-09-16 refu
9486265 2 fr_DZA_elwata 2015-09-16 migr DEU AUT SVK DEU AUT SVK
9486265 3 fr_DZA_elwata 2015-09-16 refu migr BEL DEU AUT SVK BEL DEU AUT SVK
9486265 4 fr_DZA_elwata 2015-09-16

The hypercube is the result of an aggregation of foreign news according several dimensions:

  • who : this dimension is related to the variable which describe the media outlets which published the RSS feeds. Each source is related to a codell_sss_xxxxxx where llis the language, sssis the ISO3 code of the country and xxxxxxthe name of the media. For instance, a RSS feed produced by the Algerian newspaper El Watan is identified by the code who = fr_DZA_elwata. Starting from there, it is then possible to proceed to aggregation of the data by group of languages (eg. computation of the indicators for all the French speaking newspapers) or countries (compute the indicators for all the media outlets located in Algeria).

  • when : this dimension describe the day when an article of the RSS feeds has been published, according to a reference time zone (Paris in present case). Starting from the day, the data will be further aggregated according to different period of aggregation: weeks, months, quarters or years. . For instance, by choosing to work on monthly aggregated data, the first period of observation for the news presented as example will be: when = 2015-09-01. If we choose a division in weeks, we have to decide if the week start on Sunday (default option of R) or start on Monday (option adopted in present case)

  • where1 and where2 : this dual dimension is associated to the cross-list of foreign countries detected by the country dictionary in the news. For example the second sentence of our exampple (“L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.”) has produced a list of three places (DEU,AUT,SVK) associated to the cross-list of nine couple of places (AUT-AUT, AUT-DEU, AUT-SVK, DEU-DEU, DEU-AUT, DEU-SVK, SVK-AUT, SVK-DEU, SVK-SVK) where each couple will receive a weight of 1/9. It is important to keep in mind that the countries where the media are located (mentioned in the who dimension) should be excluded from the list if we decide to work only on foreign news.

  • what : In general, this dimension can be described as a boolean value (TRUE/FALSE) which precise if the news is associated or not to the topic of interest. For example the title and the two first sentences of our example are associated to the topic of international mobility but not the third sentence where the expected keywords has not been found. But if we have introduced subtopics, the situation is more complex because the news can be associated to different subtopics (as it was associated to different states). For example the second sentence of the description (“Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP”) is associated to 2 subtopics (refug, migr) and 4 countries (BEL, AUT, DEU, SVK). It will therefore be broken in 2 x 4 x4 = 32 pieces of information, each of them associated to a value of 1/16th.

  • order : To build the hypercube, it is possible to works on different size of text units: (order=1): the title or the first sentence, or (order = 2,3,4, ...): the title with the selected number of sentence of the description available. This parameter is important because some results, especially regarding the spatial dimension of the analysis (where) are more noticeable on longer texts. In our example, it is clear that the conclusions would be different if we had decided to focus only on the title which does not mention any country and is only associated to the subtopic of refugees.

4.2 Aggregation function

The elaboration of the hypercube is based on the crossing of all dimensions with one line for each singular combination. To do that, we have elaborated a specific function that combine all the 6 dimensions but can be easily adapted if less dimensions are needed.

#' @title create an hypercube
  #' @name hypercube
  #' @description create a network of interlinked states
  #' @param corpus a corpus of news in quanteda format
  #' @param order an order of sentences in the news
  #' @param who the source dimension
  #' @param when the time dimension
  #' @param timespan aggreation of time
  #' @param what a list of topics
  #' @param where1 a list of states
  #' @param where2  a list of states
  
  
  hypercube   <- function( corpus = qd,
                          order = "order",
                          who = "source",
                          when = "when",
                          timespan = "week",
                          what = "what",
                          where1 = "where1",
                          where2 = "where2")
  {
  
  
    
  # prepare data
  
    don<-docvars(corpus)
    
    df<-data.table(id     = docid(corpus),
                   order  = don[[order]],
                   who    = don[[who]],
                   when   = don[[when]],
                   what   = don[[what]],
                   where1 = don[[where1]],
                   where2 = don[[where2]])
  
    # adjust id
   df$id<-paste(df$id,"_",df$order,sep="")
   
  # change time span
    df$when<-as.character(cut(as.Date(df$when), timespan, start.on.monday = TRUE))
  
  # unnest where1
    df$where1[df$where1==""]<-"_no_"
    df<-unnest_tokens(df,where1,where1,to_lower=F)
    
  # unnest where2
    df$where2[df$where2==""]<-"_no_"
    df<-unnest_tokens(df,where2,where2,to_lower=F) 
    
  # unnest what
    df$what[df$what==""]<-"_no_"
    df<-unnest_tokens(df,what,what,to_lower=F) 
    
  
  
  # Compute weight of news
    newswgt<-df[,list(wgt=1/.N),list(id)]
    df <- merge(df,newswgt, by="id")
  
  
  # ------------------------ Hypercube creation --------------------#
    
    
  # Aggregate
    hc<- df[,.(tags = .N, news=sum(wgt)) ,.(order,who, when,where1,where2, what)]
    
  # Convert date to time
    hc$when<-as.Date(hc$when)
    
  # export
    return(hc)
    
  }

In order to test the function, we apply it firstly on our small example of the single news published by El Watan

hc_examp<-hypercube( corpus   = examp,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "mobil",
                      where1   = "states",
                      where2   = "states")
  kable(hc_examp)
order who when where1 where2 what tags news
1 fr_DZA_elwata 2015-09-16 no no refu 1 1.0000000
2 fr_DZA_elwata 2015-09-16 DEU DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 DEU AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 DEU SVK migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT SVK migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK SVK migr 1 0.1111111
3 fr_DZA_elwata 2015-09-16 BEL BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK SVK migr 1 0.0312500
4 fr_DZA_elwata 2015-09-16 no no no 1 1.0000000
  • order = 1 : the title is described by a single line because we have only one subtopic and no states mentioned. The weight of the line is 1.
  • order = 2 : the first sentence of description is characterized by one subtopic and three different states which produce 9 lines with weight of 1/9 = 0.111 news.
  • order = 3 : the second sentence of description is characterized by two subtopic and four diffeent states which produce 32 lines with weight of 1/32 = 0.031 news.
  • order = 4 : the last sentence of description is characterized by no topics and no states which produce 1 lines with weight of 1.

4.3 Application

Of course it is not interesting to transform a single news in such a large table. But it is of high interest if we realize the agregation on a large number of news. Because in this case the number of combination of dimensions is limited and we can obtain a synthetic table called hypercube that summarize all the information extracted from the news in a relatively small object. The time of computation of an hypercube can be relatively large and the memory size necessary to the intermediary step of disagregation can be important, but the resulting object is small and very adapted for a large number of exploration and modelisation methods.

In practice, the function based on data.table package appears to be very fast as we can see in the following example

hc<-hypercube( corpus   = qd,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "mobil",
                      where1   = "states",
                      where2   = "states")
  
  saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil_day.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil_day.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  0.285 Mo"

We can see that the resulting object is rather small (0.3 Mo) which will make easier the production of visualization based on the crossing of the different dimensions.

If we want to wok only at month level, the hypercube could be even smaller :

hc<-hypercube( corpus   = qd,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "month",
                      what     = "mobil",
                      where1   = "states",
                      where2   = "states")
  
  saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil_month.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil_month.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  0.149 Mo"

5 Hypercubes exploration

The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined.

source("pgm/hypernews_functions_V6.R")

We start here from the situation of a researcher interested by the topic of human mobility and we load the hypercube elaborated in the previous section. We decide to analyze the topic without distinction between migrants and refugees.

hc <- readRDS("data/corpus/hc_mycorpus_states_mobil_month.RDS")

5.1 WHAT

The first question (WHAT) is the evaluation of the proportion of news related to the topic.

res_what <- what(hc = hc,
               subtop = NA,
               title = "Topic news")
  res_what$table
    what   news       pct
  1: FALSE 112482 97.272476
  2:  TRUE   3154  2.727524
res_what$plotly

The table indicate that 3154 news was associated to the topic which represent 2.72% of the total.

5.2 WHO.WHAT

The second question (WHO.WHAT) explore the variation of interest for the topic in the different media of the corpus.

5.2.1 Example

res_who_what<- who.what(hc=hc, 
                          test = FALSE,
                          minsamp = 20,
                          mintest = 5,
                          title = "Topic news by media - Salience")
  
  kable(res_who_what$table)
who trial success null.value estimate salience chi2 p.value index
fr_FRA_libera 28827 607 0.0273 0.02106 0.7714286 42.08 1.00000 0.7714286
fr_DZA_elwata 11581 292 0.0273 0.02521 0.9234432 1.82 0.91137 0.9234432
fr_BEL_derheu 16358 349 0.0273 0.02134 0.7816850 21.69 1.00000 0.7816850
fr_FRA_figaro 22999 1005 0.0273 0.04370 1.6007326 232.26 0.00000 1.6007326
fr_BEL_lesoir 26057 652 0.0273 0.02502 0.9164835 5.01 0.98737 0.9164835
fr_DZA_xpress 9814 249 0.0273 0.02537 0.9293040 1.30 0.87310 0.9293040
res_who_what$plotly 
res_who_what<- who.what(hc=hc, 
                          test = TRUE,
                          minsamp = 5,
                          mintest = 1,
                          title = "Topic news by media - Significance")
  res_who_what$plotly

The analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).

5.3 WHEN.WHAT

The third question (WHEN.WHAT) is related to the evolution of the interest of all media from the corpus for the topic of interest through time.

5.3.1 Example

res_when_what<- when.what(hc=hc, 
                            test=FALSE,
                            minsamp=10,
                            mintest=5,
                            title = "Topic news by month - Salience")
  
  res_when_what$plotly
res_when_what<- when.what(hc=hc, 
                            test=TRUE,
                            minsamp=10,
                            mintest=5,
                            title = "Topic news by month - Significance")
  res_when_what$plotly

The analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.

5.4 WHERE.WHAT

The fourth question (WHERE.WHAT) analyze the countries that are the most associated to the topic of interest. We exclude therefore the news where no countries are mentioned and we analyze for each country the proportion of news that are associated to the topic.

5.4.1 Example

map<-readRDS("data/map/world_ctr_4326.Rdata")
  hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_")
  res_where_what<- where.what(hc=hc2,
                              test=FALSE,
                              map = map, 
                              minsamp=10,
                              mintest =5,
                              title = "Topic news by states - Salience")
  res_where_what$plotly
res_where_what<- where.what(hc=hc2,
                              test=TRUE,
                              minsamp=10,
                              map = map, 
                              mintest =5,
                              title = "Topic news by states - Significance")
  res_where_what$plotly

The analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.

5.5 WHEN.WHO.WHAT

Despite our limited sample size, we can try to ask more complex question that combine three dimensions. We can for example examine the synchronization of media through time about the topic of interest (WHEN.WHO.WHAT).

5.5.1 Example

res_when_who_what<- when.who.what(hc=hc,
                                    test = FALSE,
                                    minsamp = 20,
                                    mintest = 5,
                                    title = "Topic news by month and by media - Salience")
  res_when_who_what$plotly
res_when_who_what<- when.who.what(hc=hc,
                                    test = TRUE,
                                    minsamp = 20,
                                    mintest = 5,
                                    title = "Topic news by month and by media - Significance")
  res_when_who_what$plotly

The figure reveals a global synchronization of media agenda concerning the topic, especially concerning the major peak of interest located in september 2015. The first period of crisis of April 2015 is also visible in all media, with the exception of the belgian newspaper “Dernière Heure” which did not cover apparently the dramatic events of boat sinking in the Mediterranean more than usual. Another interesting difference can be observed for the two algerian newspapers that was characterized by an higher coverage of the topic during the year 2014.

5.6 WHERE.WHO.WHAT

Another example of combination of the three dimensions can be realized by exploring if some countries are more mentioned by some media in relation with the topic of interest. In other words, do we observe a geographic synchronization of the agenda of media.

5.6.1 Example

hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_") %>% mutate(who=substr(who,4,6))
  res_where_who_what<- where.who.what(hc= hc2,
                                      maxloc= 10,
                                      test=FALSE,
                                      minsamp=5,
                                      mintest=2,
                                      title = "Topic news by media and by states - Salience")
  res_where_who_what$plotly
res_where_who_what<- where.who.what(hc= hc2,
                                      maxloc= 10,
                                      test=TRUE,
                                      minsamp=5,
                                      mintest=2,
                                      title = "Topic news by media and by states - Salience")
  res_where_who_what$plotly

The analyse reveals that some countries are systematically associated to the topic by all media like Turkey or Greece. But other countries like Syria are associated to variable patterns of interest in relation with the topic: it is clearly more associated in Algeria, neutral in Belgium and less associated in France.

5.7 WHEN.WHERE.WHAT

The last case of combination of three dimensions concerns the time variation of the association between the topics and the countries through time. It can typically reveal the effect of dramatic event occuring in one country at a period of time. Unfortunately, the sample is too limited in size for an in depth exploration of crisis and we are obliged to limit our example to the comparison of the 8 quarters of years.

5.7.1 Function

5.7.2 Example

hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_") %>% mutate(when=cut(when, breaks="quarte"))
  res_when_where_what<- when.where.what(hc=hc2,
                                      maxloc= 10,
                                      test=FALSE,
                                      minsamp=5,
                                      mintest=2,
                                      title = "Topic news by year and by states - Salience")
  res_when_where_what$plotly
res_when_where_what<- when.where.what(hc=hc2,
                                      maxloc= 10,
                                      test=TRUE,
                                      minsamp=5,
                                      mintest=2,
                                      title = "Topic news by year and by states - Significance")
  res_when_where_what$plotly

The analysis confirms that in the majority of case, the most important association of countries with the topic took place in 2015 with a major peak in the third quarter (August-September-October). But some interesting exceptions can be observed, in particular in the case of Italy which was associated earlier to the question of migrants and refugees. But the analysis is difficult here because of the too important level of time aggregation.

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate fr_FR.UTF-8
ctype fr_FR.UTF-8
tz Europe/Paris
date 2021-12-02
package ondiskversion source
data.table 1.13.0 CRAN (R 4.0.2)
dplyr 1.0.2 CRAN (R 4.0.2)
ggplot2 3.3.3 CRAN (R 4.0.2)
ggraph 2.0.4 CRAN (R 4.0.2)
knitr 1.34 CRAN (R 4.0.2)
lubridate 1.7.9.2 CRAN (R 4.0.2)
plotly 4.9.2.2 CRAN (R 4.0.2)
quanteda 3.0.0 CRAN (R 4.0.2)
RColorBrewer 1.1.2 CRAN (R 4.0.2)
readr 1.4.0 CRAN (R 4.0.2)
readtext 0.80 CRAN (R 4.0.2)
rmarkdown 2.11 CRAN (R 4.0.2)
rzine 0.1.0 gitlab ()
sf 0.9.8 CRAN (R 4.0.2)
stringr 1.4.0 CRAN (R 4.0.2)
tidygraph 1.2.0 CRAN (R 4.0.2)
tidytext 0.2.6 CRAN (R 4.0.2)
visNetwork 2.0.9 CRAN (R 4.0.2)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire